Handle cloning of objects larger than 4GB on Windows by dscho · Pull Request #2102 · gitgitgadget/git

dscho · 2026-04-28T00:01:44Z

On Windows, unsigned long is 32-bit even on 64-bit systems. This
causes multiple problems when Git handles objects larger than 4GB.
This patch series is a very targeted fix for a very early part of the
problem: it addresses the most fundamental truncation points that
prevent a >4GB object from surviving a clone at all.

Specifically, this fixes:

zlib's uLong wrapping and triggering BUG() assertions in the
git_zstream wrapper
Object sizes being truncated in pack streaming, delta headers, and
index-pack/unpack-objects
pack-objects re-encoding reused pack entries with a truncated
size, producing corrupt packs on the wire

Many other code paths still use unsigned long for object sizes
(e.g., cat-file -s, object_info.sizep, the delta machinery) and
will need their own conversions. This series does not attempt to fix
those.

Based on work by @LordKiRon in git-for-windows#6076.

The last two commits add a test helper that synthesizes a pack with
a >4GB blob and regression tests that clone it via both the
unpack-objects and index-pack code paths using file://
transport.

cc: Derrick Stolee stolee@gmail.com
cc: Torsten Bögershausen tboegi@web.de
cc: Jeff King peff@peff.net

When unpacking objects from a packfile, the object size is decoded from a variable-length encoding. On platforms where unsigned long is 32-bit (such as Windows, even in 64-bit builds), the shift operation overflows when decoding sizes larger than 4GB. The result is a truncated size value, causing the unpacked object to be corrupted or rejected. Fix this by changing the size variable to size_t, which is 64-bit on 64-bit platforms, and ensuring the shift arithmetic occurs in 64-bit space. This was originally authored by LordKiRon <https://github.com/LordKiRon>, who preferred not to reveal their real name and therefore agreed that I take over authorship. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

On Windows, zlib's `uLong` type is 32-bit even on 64-bit systems. When processing data streams larger than 4GB, the `total_in` and `total_out` fields in zlib's `z_stream` structure wrap around, which caused the sanity checks in `zlib_post_call()` to trigger `BUG()` assertions. The git_zstream wrapper now tracks its own 64-bit totals rather than copying them from zlib. The sanity checks compare only the low bits, using `maximum_unsigned_value_of_type(uLong)` to mask appropriately for the platform's `uLong` size. This is based on work by LordKiRon in git-for-windows#6076. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

The odb_read_stream structure uses unsigned long for the size field, which is 32-bit on Windows even in 64-bit builds. When streaming objects larger than 4GB, the size would be truncated to zero or an incorrect value, resulting in empty files being written to disk. Change the size field in odb_read_stream to size_t and introduce unpack_object_header_sz() to return sizes via size_t pointer. Since object_info.sizep remains unsigned long for API compatibility, use temporary variables where the types differ, with comments noting the truncation limitation for code paths that still use unsigned long. This was originally authored by LordKiRon <https://github.com/LordKiRon>, who preferred not to reveal their real name and therefore agreed that I take over authorship. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

The delta header decoding functions return unsigned long, which truncates on Windows for objects larger than 4GB. Introduce size_t variants get_delta_hdr_size_sz() and get_size_from_delta_sz() that preserve the full 64-bit size, and use them in packed_object_info() where the size is needed for streaming decisions. This was originally authored by LordKiRon <https://github.com/LordKiRon>, who preferred not to reveal their real name and therefore agreed that I take over authorship. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

To test Git's behavior with very large pack files, we need a way to generate such files quickly. A naive approach using only readily-available Git commands would take over 10 hours for a 4GB pack file, which is prohibitive. Side-stepping Git's machinery and actual zlib compression by writing uncompressed content with the appropriate zlib header makes things much faster. The fastest method using this approach generates many small, unreachable blob objects and takes about 1.5 minutes for 4GB. However, this cannot be used because we need to test git clone, which requires a reachable commit history. Generating many reachable commits with small, uncompressed blobs takes about 4 minutes for 4GB. But this approach 1) does not reproduce the issues we want to fix (which require individual objects larger than 4GB) and 2) is comparatively slow because of the many SHA-1 calculations. The approach taken here generates a single large blob (filled with NUL bytes), along with the trees and commits needed to make it reachable. This takes about 2.5 minutes for 4.5GB, which is the fastest option that produces a valid, clonable repository with an object large enough to trigger the bugs we want to test. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

The shift overflow bug in index-pack and unpack-objects caused incorrect object size calculation when the encoded size required more than 32 bits of shift. This would result in corrupted or failed unpacking of objects larger than 4GB. Add a test that creates a pack file containing a 4GB+ blob using the new 'test-tool synthesize pack --reachable-large' command, then clones the repository to verify the fix works correctly. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

dscho · 2026-04-28T13:02:31Z

/submit

gitgitgadget · 2026-04-28T16:26:53Z

Submitted as pull.2102.git.1777393580.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2102/dscho/fix-large-clones-on-windows-v1

To fetch this version to local tag pr-2102/dscho/fix-large-clones-on-windows-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2102/dscho/fix-large-clones-on-windows-v1

gitgitgadget · 2026-04-29T15:18:17Z

@@ -86,8 +86,11 @@ void *patch_delta(const void *src_buf, unsigned long src_size,
 * This must be called twice on the delta data buffer, first to get the


Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <johannes.schindelin@gmx.de> > > The delta header decoding functions return unsigned long, which > truncates on Windows for objects larger than 4GB. Introduce size_t > variants get_delta_hdr_size_sz() and get_size_from_delta_sz() that > preserve the full 64-bit size, and use them in packed_object_info() > where the size is needed for streaming decisions. > + * Size_t variant that doesn't truncate - use for >4GB objects on Windows. > + */ > +static inline size_t get_delta_hdr_size_sz(const unsigned char **datap, > + const unsigned char *top) ... > +static inline unsigned long get_delta_hdr_size(const unsigned char **datap, > + const unsigned char *top) > +{ > + size_t size = get_delta_hdr_size_sz(datap, top); > return cast_size_t_to_ulong(size); > } I like this trick to use the 64-bit implementation and only to down-cast for API compatibility. This allows a more gradual transition than if we replaced ulongs with size_ts everywhere at once. Thanks, -Stolee

gitgitgadget · 2026-04-29T15:18:20Z

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

gitgitgadget · 2026-04-29T15:18:38Z

@@ -49,4 +49,41 @@ test_expect_success 'clone - with worktree, file:// protocol' '



Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <johannes.schindelin@gmx.de> > > The shift overflow bug in index-pack and unpack-objects caused incorrect > object size calculation when the encoded size required more than 32 bits > of shift. This would result in corrupted or failed unpacking of objects > larger than 4GB. > > Add a test that creates a pack file containing a 4GB+ blob using the > new 'test-tool synthesize pack --reachable-large' command, then clones > the repository to verify the fix works correctly. As mentioned in the previous patch, constructing this large packfile takes ~4 minutes in CI pipelines. That's quite a lot to handle for every CI run. > +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' ' Your prereq here prevents it from running on 32-bit builds, which is good. However, I wonder if it would be worth also specifying these tests as expensive. It's less likely that these layers will be touched often, so it should be enough to run these on major occasions, such as testing a release candidate. (I also think it's appropriate to have these tests _not_ be marked expensive in their original contribution to git-for-windows/git, because the pull request build should prove that the tests work. And maybe git-for-windows/git should keep them for every PR since that's where the tests matter the most.) I suppose this also is a question for Junio and our process for validating releases. Do we have a certain cadence where we run the expensive tests? What has been our threshold for hiding a test case behind the expensive label? Thanks, -Stolee

Jeff King wrote on the Git mailing list (how to reply to this email):

On Wed, Apr 29, 2026 at 09:34:21AM -0400, Derrick Stolee wrote: > As mentioned in the previous patch, constructing this large packfile > takes ~4 minutes in CI pipelines. That's quite a lot to handle for > every CI run. And for local runs, too. ;) The test suite takes less than 90 seconds to run on my laptop, but t5608 by itself 160 seconds. Even if running it in parallel didn't slow down the rest of the suite (which is not true, because it's obviously hogging a whole processor the whole time), that's still almost doubling the run-time. I'd also worry about assuming that the trash directory can hold 4+GB (maybe 8GB+ since we clone it?), especially since many of us use ram disks. That said... > > +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' ' > > Your prereq here prevents it from running on 32-bit builds, which is > good. However, I wonder if it would be worth also specifying these > tests as expensive. It's less likely that these layers will be touched > often, so it should be enough to run these on major occasions, such as > testing a release candidate. I think it is already skipped in most cases, because t5608 requires the GIT_TEST_CLONE_2GB environment variable be set. Arguably it should just be using EXPENSIVE, too, as I do not think there is much value in having individual flags for all of the expensive tests. I think that test just predates the modern prereq system entirely. > I suppose this also is a question for Junio and our process for > validating releases. Do we have a certain cadence where we run the > expensive tests? What has been our threshold for hiding a test case > behind the expensive label? AFAIK the labeling of expensive things is mostly ad-hoc, and nobody is systematically running them. Likewise for the t/perf tests, which are super expensive but do (very occasionally) turn up interesting regressions. -Peff

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 5/1/2026 2:38 AM, Jeff King wrote: > On Wed, Apr 29, 2026 at 09:34:21AM -0400, Derrick Stolee wrote: >>> +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' ' >> >> Your prereq here prevents it from running on 32-bit builds, which is >> good. However, I wonder if it would be worth also specifying these >> tests as expensive. It's less likely that these layers will be touched >> often, so it should be enough to run these on major occasions, such as >> testing a release candidate. > > I think it is already skipped in most cases, because t5608 requires the > GIT_TEST_CLONE_2GB environment variable be set. Arguably it should just > be using EXPENSIVE, too, as I do not think there is much value in having > individual flags for all of the expensive tests. I think that test just > predates the modern prereq system entirely. Thanks for the extra details here! That helps avoid the issues that I was thinking about, but maybe doubling-down and adding EXPENSIVE is still worth it. >> I suppose this also is a question for Junio and our process for >> validating releases. Do we have a certain cadence where we run the >> expensive tests? What has been our threshold for hiding a test case >> behind the expensive label? > > AFAIK the labeling of expensive things is mostly ad-hoc, and nobody is > systematically running them. Likewise for the t/perf tests, which are > super expensive but do (very occasionally) turn up interesting > regressions. I used to be more diligent about running the performance tests myself around release windows. The EXPENSIVE tests would also be good to do on rc0. I will contemplate how to put this into my routine. Thanks, -Stolee

gitgitgadget · 2026-04-29T15:18:50Z

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote:
> On Windows, unsigned long is 32-bit even on 64-bit systems. This causes
> multiple problems when Git handles objects larger than 4GB. This patch
> series is a very targeted fix for a very early part of the problem: it
> addresses the most fundamental truncation points that prevent a >4GB object
> from surviving a clone at all.
> > Specifically, this fixes:
> >   * zlib's uLong wrapping and triggering BUG() assertions in the git_zstream
>     wrapper
>   * Object sizes being truncated in pack streaming, delta headers, and
>     index-pack/unpack-objects
>   * pack-objects re-encoding reused pack entries with a truncated size,
>     producing corrupt packs on the wire

I'm glad to see this progress in this direction. It's a big step!

> Many other code paths still use unsigned long for object sizes (e.g.,
> cat-file -s, object_info.sizep, the delta machinery) and will need their own
> conversions. This series does not attempt to fix those.

I appreciate the mechanisms used to keep the scope of change minimal. This
differs from the typical "replace all 'unsigned long's with 'size_t'"
proposals in some clever ways.

> Based on work by @LordKiRon in git-for-windows/git#6076.
> > The last two commits add a test helper that synthesizes a pack with a >4GB
> blob and regression tests that clone it via both the unpack-objects and
> index-pack code paths using file:// transport.

My biggest concern here is about how expensive this test is. I wonder if
we should mark it as expensive for the core project and leave it enabled by
default in git-for-windows/git.

Thanks,
-Stolee

gitgitgadget · 2026-04-30T14:29:39Z

@@ -37,7 +37,7 @@ static const char index_pack_usage[] =



Torsten Bögershausen wrote on the Git mailing list (how to reply to this email):

On Tue, Apr 28, 2026 at 04:26:15PM +0000, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <johannes.schindelin@gmx.de> > > When unpacking objects from a packfile, the object size is decoded > from a variable-length encoding. On platforms where unsigned long is > 32-bit (such as Windows, even in 64-bit builds), the shift operation > overflows when decoding sizes larger than 4GB. The result is a > truncated size value, causing the unpacked object to be corrupted or > rejected. > > Fix this by changing the size variable to size_t, which is 64-bit on > 64-bit platforms, and ensuring the shift arithmetic occurs in 64-bit > space. > > This was originally authored by LordKiRon <https://github.com/LordKiRon>, > who preferred not to reveal their real name and therefore agreed that I > take over authorship. Good to see things moving forward. See even https://github.com/git-for-windows/git/pull/2179 which is probably obsolete soon.

gitgitgadget · 2026-04-30T14:29:42Z

User Torsten Bögershausen <tboegi@web.de> has been added to the cc: list.

gitgitgadget · 2026-05-01T06:47:11Z

User Jeff King <peff@peff.net> has been added to the cc: list.

dscho added 6 commits April 28, 2026 01:13

dscho marked this pull request as ready for review April 28, 2026 13:02

dscho mentioned this pull request Apr 28, 2026

Fix issues when cloning repositories with large blobs (>4GB) git-for-windows/git#6069

Closed

dscho mentioned this pull request Apr 28, 2026

Fix issues when cloning repositories with large blobs (>4GB) git-for-windows/git#6218

Draft

gitgitgadget Bot reviewed Apr 29, 2026

View reviewed changes

gitgitgadget Bot reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle cloning of objects larger than 4GB on Windows#2102

Handle cloning of objects larger than 4GB on Windows#2102
dscho wants to merge 6 commits intogitgitgadget:masterfrom
dscho:fix-large-clones-on-windows

dscho commented Apr 28, 2026 •

edited by gitgitgadget Bot

Loading

Uh oh!

dscho commented Apr 28, 2026

Uh oh!

gitgitgadget Bot commented Apr 28, 2026

Uh oh!

gitgitgadget Bot Apr 29, 2026

Uh oh!

gitgitgadget Bot commented Apr 29, 2026

Uh oh!

gitgitgadget Bot Apr 29, 2026

Uh oh!

gitgitgadget Bot May 1, 2026

Uh oh!

gitgitgadget Bot May 1, 2026

Uh oh!

gitgitgadget Bot commented Apr 29, 2026

Uh oh!

gitgitgadget Bot Apr 30, 2026

Uh oh!

gitgitgadget Bot commented Apr 30, 2026

Uh oh!

gitgitgadget Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -86,8 +86,11 @@ void patch_delta(const void src_buf, unsigned long src_size,
		* This must be called twice on the delta data buffer, first to get the

		@@ -49,4 +49,41 @@ test_expect_success 'clone - with worktree, file:// protocol' '

Conversation

dscho commented Apr 28, 2026 • edited by gitgitgadget Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dscho commented Apr 28, 2026

Uh oh!

gitgitgadget Bot commented Apr 28, 2026

Uh oh!

gitgitgadget Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot commented Apr 29, 2026

Uh oh!

gitgitgadget Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot commented Apr 29, 2026

Uh oh!

gitgitgadget Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gitgitgadget Bot commented Apr 30, 2026

Uh oh!

gitgitgadget Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dscho commented Apr 28, 2026 •

edited by gitgitgadget Bot

Loading